The data set is the final data set provided by Udacity for the intro to machine learning course.

It includes a combination of financial data and variables created from email data.

There are 22 variables.

Descriptive Statistics

##                   X           bonus         deferral_payments
##  ALLEN PHILLIP K   :  1   Min.   :  70000   Min.   :-102500  
##  BADUM JAMES P     :  1   1st Qu.: 425000   1st Qu.:  79644  
##  BANNANTINE JAMES M:  1   Median : 750000   Median : 221064  
##  BAXTER JOHN C     :  1   Mean   :1201773   Mean   : 841603  
##  BAY FRANKLIN R    :  1   3rd Qu.:1200000   3rd Qu.: 867211  
##  BAZELIDES PHILIP J:  1   Max.   :8000000   Max.   :6426990  
##  (Other)           :138   NA's   :63        NA's   :106      
##  deferred_income    director_fees                    email_address
##  Min.   :-3504386   Min.   :  3285                          : 33  
##  1st Qu.: -611209   1st Qu.: 83674   a..martin@enron.com    :  1  
##  Median : -151927   Median :106164   adam.umanoff@enron.com :  1  
##  Mean   : -581050   Mean   : 89823   andrew.fastow@enron.com:  1  
##  3rd Qu.:  -37926   3rd Qu.:112815   ben.glisan@enron.com   :  1  
##  Max.   :    -833   Max.   :137864   bill.cordes@enron.com  :  1  
##  NA's   :96         NA's   :128      (Other)                :106  
##  exercised_stock_options    expenses      from_messages     
##  Min.   :    3285        Min.   :   148   Min.   :   12.00  
##  1st Qu.:  506765        1st Qu.: 22479   1st Qu.:   22.75  
##  Median : 1297049        Median : 46548   Median :   41.00  
##  Mean   : 2959559        Mean   : 54192   Mean   :  608.79  
##  3rd Qu.: 2542813        3rd Qu.: 78408   3rd Qu.:  145.50  
##  Max.   :34348384        Max.   :228763   Max.   :14368.00  
##  NA's   :43              NA's   :50       NA's   :58        
##  from_poi_to_this_person from_this_person_to_poi loan_advances     
##  Min.   :  0.00          Min.   :  0.00          Min.   :  400000  
##  1st Qu.: 10.00          1st Qu.:  1.00          1st Qu.: 1200000  
##  Median : 35.00          Median :  8.00          Median : 2000000  
##  Mean   : 64.90          Mean   : 41.23          Mean   :27975000  
##  3rd Qu.: 72.25          3rd Qu.: 24.75          3rd Qu.:41762500  
##  Max.   :528.00          Max.   :609.00          Max.   :81525000  
##  NA's   :58              NA's   :58              NA's   :141       
##  long_term_incentive     other             poi      restricted_stock  
##  Min.   :  69223     Min.   :       2   False:126   Min.   :-2604490  
##  1st Qu.: 275000     1st Qu.:    1203   True : 18   1st Qu.:  252055  
##  Median : 422158     Median :   51587               Median :  441096  
##  Mean   : 746491     Mean   :  466411               Mean   : 1147424  
##  3rd Qu.: 831809     3rd Qu.:  331983               3rd Qu.:  985032  
##  Max.   :5145434     Max.   :10359729               Max.   :14761694  
##  NA's   :79          NA's   :53                     NA's   :35        
##  restricted_stock_deferred     salary        shared_receipt_with_poi
##  Min.   :-1787380          Min.   :    477   Min.   :   2.0         
##  1st Qu.: -329825          1st Qu.: 211802   1st Qu.: 249.8         
##  Median : -140264          Median : 258741   Median : 740.5         
##  Mean   :  621893          Mean   : 284088   Mean   :1176.5         
##  3rd Qu.:  -72419          3rd Qu.: 308606   3rd Qu.:1888.2         
##  Max.   :15456290          Max.   :1111258   Max.   :5521.0         
##  NA's   :127               NA's   :50        NA's   :58             
##   to_messages      total_payments      total_stock_value 
##  Min.   :   57.0   Min.   :      148   Min.   :  -44093  
##  1st Qu.:  541.2   1st Qu.:   396934   1st Qu.:  494136  
##  Median : 1211.0   Median :  1101393   Median : 1095040  
##  Mean   : 2073.9   Mean   :  2641806   Mean   : 3352073  
##  3rd Qu.: 2634.8   3rd Qu.:  2087530   3rd Qu.: 2606763  
##  Max.   :15149.0   Max.   :103559793   Max.   :49110078  
##  NA's   :58        NA's   :21          NA's   :19        
##                  name    
##  ALLEN PHILLIP K   :  1  
##  BADUM JAMES P     :  1  
##  BANNANTINE JAMES M:  1  
##  BAXTER JOHN C     :  1  
##  BAY FRANKLIN R    :  1  
##  BAZELIDES PHILIP J:  1  
##  (Other)           :138

Uni-variate analysis

The histograms of all of the financial variables highlight a variety of distributions present in the data-set. Many of the variables have missing values and have these dropped for the plot.

There are no normal distributions, nearly all have skewed distributions.

Data transforms are applied to better represent the distribution of values. Either log10 or sqrt are applied. This gives a better visualization of the over dispersed variables.

The variables sourced from emails also show highly skewed variables.

The data transformations give a better visualization of the distribution of values, a number of which have more of a normal distribution after a log10 transformation.

For the financial variables frequency polygons are used to investigate persons of interest.

A few of the variables are difficult to separate any trends between POI and normal people.

Looking at POI within the email data highlights some promising variables. Shared receipt with POI has a spike for true POI but is overlain by a number of non-POI responses as well.

From POI to this person suggests higher numbers can be related to other POIs.

From this person to POI is challenging, with mixed occurrences of True within the distribution.

The pair plot gives a way to quickly see any highly correlated variables.

Loan advances is removed as it has too few data points.

Investigating the correlation between to messages and shared receipt with POI.

Investigating the correlation of from messages vs from a person of interest.

The plot uses ratios of email variables to highlight the persons of interest and how they vary from non-persons of interest.

The plot uses ratio of total payments and bonus against salary to try and separate out POIs.